Skip to content

Backward sampling#81

Closed
osorensen wants to merge 14 commits intomainfrom
backward-sampling
Closed

Backward sampling#81
osorensen wants to merge 14 commits intomainfrom
backward-sampling

Conversation

@osorensen
Copy link
Copy Markdown
Owner

No description provided.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements Conditional Particle Filter with Independent Backward Simulation (CPF-IBS) for the SMC2 rejuvenation step in the BayesMallowsSMC2 R package, bumping the version to 0.3.0.

Changes:

  • Added backward_sampling option to set_smc_options() plumbed through to C++ via Options
  • Implemented assemble_backward_trajectory() in C++ that samples reference trajectories by drawing independently from stored forward filtering weights
  • Added tests for backward sampling across complete, partial, preference, and mixture data scenarios

Reviewed changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/particle.cpp Core implementation of assemble_backward_trajectory and weight storage
src/particle.h Declaration of new method and stored_weights member
src/rejuvenate.cpp Integration of backward sampling into rejuvenation step
src/options.h / src/options.cpp Added backward_sampling field
R/set_smc_options.R New backward_sampling parameter
R/compute_sequentially.R Documentation update and formatting
tests/testthat/test-compute_sequentially_*.R New backward sampling tests
man/*.Rd Documentation with references
inst/REFERENCES.bib New bibliography entries
DESCRIPTION / NEWS.md / cran-comments.md Version bump and changelog
agents.md AI development prompt (should not be shipped)
.gitignore / .Rbuildignore Build config updates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/particle.cpp
Comment on lines +125 to +180
for (int t = T; t >= 0; --t) {
arma::vec current_weights = stored_weights[t];

// Sample a single index b_t based on current_weights
arma::ivec counts = resampler->resample(1, current_weights);
unsigned int b_t =
arma::as_scalar(arma::find(counts > 0, 1)); // The chosen index

unsigned int num_users_at_t =
particle_filters[b_t]
.latent_rankings.col(t)
.n_cols; // actually wait, .col(t) returns EXACTLY 1 column.

if (new_reference.latent_rankings.is_empty()) {
new_reference.latent_rankings =
particle_filters[b_t].latent_rankings.col(t);
if (parameters.tau.size() > 1) {
// The total number of users up to time t in the forward pass is the
// length of cluster_assignments
unsigned int end_idx =
particle_filters[b_t].cluster_assignments.n_elem - 1;
// Since .col(t) grabbed 1 column, but what if multiple users were
// processed? Ah! In `run_particle_filter`, `proposal.proposal` is
// joined! Wait, `pf.latent_rankings = join_horiz(pf.latent_rankings,
// proposal.proposal);` If `proposal.proposal` had 5 columns at time
// `t`, then `pf.latent_rankings` grew by 5 columns! So
// `latent_rankings` columns correspond to USERS, not timepoints! So
// `col(t)` is completely wrong! We need to extract the columns
// corresponding to time `t`. Let's look at `sample_latent_rankings`.
// For complete data, 1 user = 1 row = 1 timepoint! Wait... for mixture
// models, see test: `compute_sequentially(mixtures[1:50,])`. `mixtures`
// has 1 row per user. So `n_timepoints` = 50. At each timepoint, 1 user
// is processed. So `proposal.proposal.n_cols` = 1. Thus
// `latent_rankings` has exactly 1 column per timepoint.
// `num_users_at_t` is always 1! SO WHY DID IT SEGFAULT? Because
// `col(t)` returns exactly 1 column, `num_users_at_t` is 1. Let's check
// `start_idx`.
new_reference.cluster_assignments =
particle_filters[b_t].cluster_assignments.subvec(t, t);
new_reference.cluster_probabilities =
particle_filters[b_t].cluster_probabilities.cols(t, t);
new_reference.index = uvec(T + 1, fill::zeros);
}
} else {
new_reference.latent_rankings.insert_cols(
0, particle_filters[b_t].latent_rankings.col(t));
if (parameters.tau.size() > 1) {
new_reference.cluster_assignments.insert_rows(
0, particle_filters[b_t].cluster_assignments.subvec(t, t));
new_reference.cluster_probabilities.insert_cols(
0, particle_filters[b_t].cluster_probabilities.cols(t, t));
}
}

new_reference.log_weight(t) = particle_filters[b_t].log_weight(t);
}
Comment thread agents.md
Comment on lines +1 to +26
# Role and Identity
You are an expert C++ and RcppArmadillo developer specializing in Bayesian computation, Sequential Monte Carlo (SMC2), and Particle Markov Chain Monte Carlo (PMCMC) methods. You are working on the `BayesMallowsSMC2` R package.

# Project Context
The `BayesMallowsSMC2` package implements sequential inference for the Bayesian Mallows Model using an SMC2 algorithm. Currently, the package uses a standard Conditional Particle Filter (CPF) for the Particle Gibbs rejuvenation step. This standard approach suffers from path degeneracy because it relies on deterministic forward ancestral tracing, which severely limits the mixing of latent variables for early users.

# Objective
Your goal is to upgrade the rejuvenation step by implementing **Conditional Particle Filter with Independent Backward Simulation (CPF-IBS)**. Because cross-sectional user batches are conditionally independent given the static parameters, the standard O(S^2) backward simulation simplifies exactly to O(S) independent categorical draws from the marginal forward filtering weights.

# Mathematical Specification of CPF-IBS
1. **Forward Pass:** The conditional particle filter processes users at time t = 1 ... T. It computes normalized weights W_t and latent states x_t for particles s = 1 ... S.
*Crucial Change:* We no longer need to track or store the forward ancestor indices (a_t).
2. **Backward Pass:** Instead of recursive pointer lookups (e.g. b_{t-1} = a_{t-1}^{b_t}), we assemble the new reference trajectory by looping backward from t = T down to 1.
3. **Sampling:** At each timestep t, sample an index b_t in {1, ..., S} independently with probabilities W_t.
4. **Extraction:** Extract the latent variables for time t from the sampled particle b_t.

# Codebase Rules & Architectural Constraints (RcppArmadillo)
1. **Memory Optimization:** Locate the T x S matrix of ancestor variables (likely an `arma::umat a` matrix inside the conditional particle filter / SMC structures) and completely remove it. This will drastically reduce the memory footprint. You only need to track the weights W_t.
2. **Weighted Sampling in C++:** When sampling the index b_t ~ W_t during trajectory assembly, utilize `arma::randi` (with the custom normalized W_t probabilities) to sample b_t independently. Ensure you are using a statistically rigorous weighted sampling function compatible with R's RNG scope so that `set.seed()` from R remains perfectly reproducible.
3. **Clean Integration:** Modify the Rejuvenation functions (Algorithm 4 / S3) so that trajectory assembly happens *after* the forward pass completes, iterating in reverse.
4. **Testing:** Ensure that the R package still compiles (`Rcpp::compileAttributes()`, `devtools::document()`, `devtools::load_all()`) and passes all existing tests (`devtools::test()`).

# Interaction Guidelines
- Before writing code, use search tools to find the exact C++ files handling the Conditional Particle Filter and Rejuvenation steps in the `src/` directory.
- Explain your planned modifications to the C++ logic before executing them.
- Keep performance in mind: cache locality and avoiding deep copies of Armadillo matrices inside the t-loop are critical. No newline at end of file
resampler = "systematic")
)
alpha_hat <- weighted.mean(x = as.numeric(mod$alpha), w = mod$importance_weights)
alpha_hat <- weighted.mean(x = as.numeric(mod$alpha), w = mod$importance_weights)
Comment thread src/particle.cpp Outdated
Comment on lines +142 to +161
// The total number of users up to time t in the forward pass is the
// length of cluster_assignments
unsigned int end_idx =
particle_filters[b_t].cluster_assignments.n_elem - 1;
// Since .col(t) grabbed 1 column, but what if multiple users were
// processed? Ah! In `run_particle_filter`, `proposal.proposal` is
// joined! Wait, `pf.latent_rankings = join_horiz(pf.latent_rankings,
// proposal.proposal);` If `proposal.proposal` had 5 columns at time
// `t`, then `pf.latent_rankings` grew by 5 columns! So
// `latent_rankings` columns correspond to USERS, not timepoints! So
// `col(t)` is completely wrong! We need to extract the columns
// corresponding to time `t`. Let's look at `sample_latent_rankings`.
// For complete data, 1 user = 1 row = 1 timepoint! Wait... for mixture
// models, see test: `compute_sequentially(mixtures[1:50,])`. `mixtures`
// has 1 row per user. So `n_timepoints` = 50. At each timepoint, 1 user
// is processed. So `proposal.proposal.n_cols` = 1. Thus
// `latent_rankings` has exactly 1 column per timepoint.
// `num_users_at_t` is always 1! SO WHY DID IT SEGFAULT? Because
// `col(t)` returns exactly 1 column, `num_users_at_t` is 1. Let's check
// `start_idx`.
Comment thread src/particle.cpp Outdated
Comment on lines +133 to +137
unsigned int num_users_at_t =
particle_filters[b_t]
.latent_rankings.col(t)
.n_cols; // actually wait, .col(t) returns EXACTLY 1 column.

Comment thread src/rejuvenate.cpp
Comment on lines +109 to 149
if (prior.n_clusters > 1) {
uvec cluster_assignments =
particle_filters[conditioned_particle_filter].cluster_assignments;
uvec cluster_frequencies =
hist(cluster_assignments, regspace<uvec>(0, prior.n_clusters - 1));

for(size_t cluster{}; cluster < prior.n_clusters; cluster++) {
parameters.tau(cluster) = R::rgamma(cluster_frequencies(cluster) + prior.cluster_concentration, 1.0);
for (size_t cluster{}; cluster < prior.n_clusters; cluster++) {
parameters.tau(cluster) = R::rgamma(
cluster_frequencies(cluster) + prior.cluster_concentration, 1.0);
}
parameters.tau = normalise(parameters.tau, 1);
Particle gibbs_particle(options, this->parameters, pfun);
gibbs_particle.conditioned_particle_filter = 0;
gibbs_particle.particle_filters[0] = this->particle_filters[this->conditioned_particle_filter];
gibbs_particle.particle_filters[0] =
this->particle_filters[this->conditioned_particle_filter];
gibbs_particle.particle_filters[0].cluster_probabilities = mat{};

for(size_t t{}; t < T + 1; t++) {
gibbs_particle.run_particle_filter(t, prior, data, pfun, distfun, resampler, options.latent_rank_proposal, true);
// In standard CPF we trace the lineage of conditioned_particle_filter.
// In backward sampling, we run a completely unconditioned forward particle
// filter!
bool requires_conditional = !options.backward_sampling;
for (size_t t{}; t < T + 1; t++) {
gibbs_particle.run_particle_filter(
t, prior, data, pfun, distfun, resampler,
options.latent_rank_proposal, requires_conditional);
}

this->log_incremental_likelihood = gibbs_particle.log_incremental_likelihood;
this->log_normalized_particle_filter_weights = gibbs_particle.log_normalized_particle_filter_weights;
this->log_incremental_likelihood =
gibbs_particle.log_incremental_likelihood;
this->log_normalized_particle_filter_weights =
gibbs_particle.log_normalized_particle_filter_weights;
this->particle_filters = gibbs_particle.particle_filters;
this->logz = gibbs_particle.logz;
this->stored_weights = gibbs_particle.stored_weights;

sample_particle_filter();
if (options.backward_sampling) {
this->assemble_backward_trajectory(T, resampler);
} else {
sample_particle_filter();
}
}
@osorensen
Copy link
Copy Markdown
Owner Author

@copilot can you review the last commits and see if they fix most issues?

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 15, 2026

@osorensen I've opened a new pull request, #82, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 2 commits March 15, 2026 19:05
…, duplicate alpha_hat

Co-authored-by: osorensen <21175639+osorensen@users.noreply.github.com>
Fix O(T²) backward trajectory, extend backward sampling to single-cluster models
@osorensen osorensen closed this Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants